Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

The Nature of Information

yes/no questions required to ascertain the number of different kinds of objects or to

identify the kind of any object chosen from the set. ³

The Shannon Index. The formula that we used to determine the quantity upper II of

information delivered by a measurement that ﬁxes the result as one out of nn equally

likely possibilities, each having a probabilityp Subscript i Baseline comma i equals 1 comma ellipsis comma npi, i = 1, . . . , n, all equal to1 divided by n1/n, was

upper I equals minus log p equals log n periodI = −log p = log n .

(6.4)

It is called Hartley’s formula. If the base of the logarithm is 2, then the formula

yields numerical values in bits. Where the probabilities of the different alternatives

are not equal, then a weighted mean must be taken:

upper I equals minus sigma summation Underscript i equals 1 Overscript n Endscripts p Subscript i Baseline log Subscript 2 Baseline p Subscript i Baseline periodI = −

i=1

pi log2 pi .

(6.5)

This generalization is called the Shannon or Shannon–Wiener index. In other words,

the quantity of information is weighted logarithmic variety. Note that the quantity

of information given by Eq. (6.5) is always less than that given by the equiprobable

case (6.4). This follows from Jensen’s inequality. ⁴

Why is the negative of the sum taken?upper II in fact represents the gain of information

due to the measurement. In general,

gain left parenthesis in something right parenthesis equals final value minus initial value periodgain (in something) = ﬁnal value −initial value .

(6.7)

The initial value represents the uncertainty in the outcome prior to the measurement.

Shannon takes the ﬁnal value (i.e., the result of the measurement), to be a single

3 This primitive notion of variety is related to the diversity measured by biometricians concerned

with assessing the variety of species in an ecosystem (biocoenosis). Diversityupper DD is essentially variety

weighted according to the relative abundances (i.e., probabilityp Subscript ipi of occurrence) of theupper NN different

types, and this can be done in different ways. Parameters in use by practitioners include

D0 = N

(no weighting),

(6.1)

D1 = exp(I )

(the exponential of Shannon’s index, Eq. 6.5),

(6.2)

D2 = 1/

i=1

p²

(the reciprocal of Simpson’s index).

(6.3)

4 If g left parenthesis x right parenthesisg(x) is a convex function on an interval left parenthesis a comma b right parenthesis(a, b), if x 1 comma x 2 comma ellipsis comma x Subscript n Baselinex1, x2, . . . , xn are arbitrary real numbers

a less than x Subscript k Baseline less than ba < xk < b, and ifw 1 comma w 2 comma ellipsis comma w Subscript n Baselinew1, w2, . . . , wn are positive numbers withsigma summation Underscript k equals 1 Overscript n Endscripts w Subscript k Baseline equals 1^Σⁿ

k=1 ^w^k⁼^{1, then}

g left parenthesis sigma summation Underscript k equals 1 Overscript n Endscripts w Subscript k Baseline x Subscript k Baseline right parenthesis less than or equals sigma summation Underscript k equals 1 Overscript n Endscripts w Subscript k Baseline g left parenthesis x Subscript k Baseline right parenthesis periodg

( ⁿ

k=1

wkxk

)

≤

k=1

wkg(xk) .

(6.6)

Inequality (6.6) is then applied to the convex functiony equals x log x left parenthesis x greater than 0 right parenthesisy = x log x (x > 0) withx Subscript k Baseline equals p Subscript kxk = pk andw Subscript k Baseline equals 1 divided by n left parenthesis k equals 1 comma 2 comma ellipsis comma n right parenthesiswk =

1/n (k = 1, 2, . . . , n) to getupper I left parenthesis p 1 comma p 2 comma ellipsis comma p Subscript n Baseline right parenthesis less than or equals log nI (p1, p2, . . . , pn) ≤log n.